Resource allocation in partial fault tolerant applications

ABSTRACT

A method for allocating a set of components of an application to a set of resource groups includes the following steps performed by a computer system. The set of resource groups is ordered based on respective failure measures and resource capacities associated with the resource groups. An importance value is assigned to each of the components. The importance value is associated with an affect of the component on an output of the application. The components are assigned to the resource groups based on the importance value of each component and the respective failure measures and resource capacities associated with the resource groups. The components with higher importance values are assigned to resource groups with lower failure measures and higher resource capacities. The application may be a partial fault tolerant (PFT) application that comprises PFT application components. The resource groups may comprise a heterogeneous set of resource groups (or clusters).

CROSS-REFERENCE TO RELATED APPLICATION(S)

This application is a continuation of U.S. application Ser. No.11/970,841, filed on Jan. 8, 2008, the disclosure of which isincorporated herein by reference.

This invention was made with Government support under Contract No.:H98230-07-C-0383 awarded by the Department of Defense. The Governmenthas certain rights in this invention.

FIELD OF THE INVENTION

The present invention generally relates to distributed data processingsystems and, more particularly, to techniques for allocating computingresources to partial fault tolerant applications in such distributeddata processing systems.

BACKGROUND OF THE INVENTION

Distributed data processing systems need to be highly available androbust to failures. Traditional approaches to fault-tolerance employtechniques such as replication or check-pointing to address theavailability requirements. However, these approaches introducewell-known tradeoffs between cost and availability. For example, areplicated service may incur significant overheads to provide strictconsistency requirements. Further, the monetary cost of implementinghighly available services can double for just a fraction of percentageof availability, and under correlated failures, even additional replicasresult in a strong diminishing return in availability improvement formany replication schemes. Similarly, the overheads of check-pointing canlimit its benefits.

Many distributed data processing systems (often operating under limitedcomputing resources) have the property that they can continue operatingand producing useful output even in the presence of applicationcomponent failures, though the output quality may be of a reduced value.We refer to these applications herein as Partial Fault Tolerant (PFT)applications. In contrast to applications that require the availabilityof all components to operate correctly, PFT applications provide a“graceful degradation” in performance as the number of failuresincreases. For example, aggregation systems such as MapReduce (see,e.g., J. Dean et al., “MapReduce: Simplified Data Processing on LargeClusters,” OSDI, 2004) based Sawzall (see, e.g., R. Pike et al.,“Interpreting the Data: Parallel Analysis with Sawzall,” ScientificProgramming Journal, Special Issue on Grids and Worldwide ComputingProgramming Models and Infrastructure, 2005), SDIMS (see, e.g., P.Yalagandula et al., “A Scalable Distributed Information ManagementSystem,” SIGCOMM, 2004), and PIER (see, e.g., R. Huebsch et al.,“Querying the Internet with Pier,” VLDB, 2003) are likely to be able totolerate some missing objects while processing a query (e.g., AVG, JOIN,etc.) on a distributed database. Similarly, data mining application suchas WTTW (see, e .g., Verscheure et al., “Finding ‘Who is Talking toWhom’ in VoIP Networks Via Progressive Stream Clustering,” ICDM, 2006)and FAB (see, e.g., Turaga et al., “Online FDC Control Limit Tuning withYield Prediction Using Incremental Decision Tree Learning,” SematechAEC/APC Symposium XIX, 2007) can still classify data objects underfailures, though with less confidence. Further, for many streamprocessing applications with stringent temporal requirements (see, e.g.,D. J. Abadi et al., “The Design of the Borealis Stream ProcessingEngine,” CIDR, 2005), it is more important to produce partial resultswithin a given time bound than full results delivered late. Finally,mission-critical applications deploy multiple sensors at differentphysical locations such that at least some of them should trigger analert during failures or when operating conditions are violated (e.g.,fire, medical emergencies, etc.).

However, none of the above fault-tolerance approaches adequately address(in terms of minimizing cost and maximizing availability) the assignmentof PFT application components or, more generally, the allocation ofcomputing resources in a distributed computing system, where thecomputing resources have certain failure characteristic and may beheterogeneous in nature.

SUMMARY OF THE INVENTION

Principles of the invention provide new techniques for assignment of PFTapplication components or, more generally, the allocation of computingresources in a distributed computing system.

For example, in one aspect of the invention, a method for allocating aset of one or more processing components of an application to a set ofone or more resource groups comprises the following steps performed by acomputer system. The set of one or more resource groups is ordered basedon respective failure measures and resource capacities associated withthe one or more resource groups. An importance value is assigned to eachof the one or more processing components, wherein the importance valueis associated with an effect of the processing component on theapplication output. The one or more processing components are assignedto the one or more resource groups based on the importance value of eachprocessing component and the respective failure measures and resourcecapacities associated with the one or more resource groups, whereinprocessing components with higher importance values are assigned toresource groups with lower failure measures and higher resourcecapacities.

The application may be a partial fault tolerant (PFT) application thatcomprises a set of one or more PFT application components. The set ofone or more resource groups may comprise a heterogeneous set of resourcegroups (or clusters of machines).

The ordering step may comprise sorting the one or more resource groupsin a decreasing order. The step of sorting may be based on a ratio of arespective resource capacity of each of the one or more resource groupsto a failure probability of each of the one or more resource groups.Alternatively, the step of sorting may be based on a product of arespective resource capacity of each of the one or more resource groupsand an availability measure of each of the one or more resource groups.The availability measure for a given resource group may be computed as1—failure probability of the given resource group.

An importance value may be based on a contribution that the processingcomponent makes to the application output. Alternatively, an importancevalue may be based on a loss incurred in the application output value ifthe resource hosting the given processing component fails.

The allocating step may also be based on one or more specifiedconstraints on the one or more components.

The allocating step may determine an order for assigning componentsbased on a data flow graph associated with the application to a set ofresource groups, such that a single resource group failure affects aminimal number of paths from a source (where computation on a data itemis initiated) to a sink (where the final output is produced) in the dataflow graph.

The allocating step may be performed after a failure of at least one ofthe components or resource groups (thus, it may also be considered arun-time reallocation).

These and other objects, features, and advantages of the presentinvention will become apparent from the following detailed descriptionof illustrative embodiments thereof, which is to be read in connectionwith the accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates a data aggregation system, according to oneembodiment of the invention.

FIG. 2 illustrates three possible allocations of three processingcomponents to two resource groups (clusters) for the data aggregationsystem in FIG. 1.

FIGS. 3A and 3B illustrate a methodology for allocating components of aPFT application running on distributed data processing systems, inaccordance with one embodiment of the invention.

FIG. 4 illustrates a computing system in which methodologies of theinvention may be implemented, according to one embodiment of theinvention.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS

Illustrative principles of the invention address a key problem of how toassign

PFT application components to a distributed computing system comprisinga set of heterogeneous resource groups under a correlated failure model(also referred to herein as “clusters”) with different resourcecapacities and availabilities. Specifically, a method for placement ofprocessing components for PFT applications is provided that prevents,delays, or minimizes the “loss” in the expected application output valueunder failures before a full recovery from failures takes effect.

By way of example only, an application component may be defined as a setof software modules which perform various operations on input dataelements in order to generate output data elements. Examples of inputdata elements include packets of audio data, email data, computergenerated events, network data packets, or readings from sensors, suchas environmental, medical or process sensors. Examples oftransformations conducted by individual application components includeparsing the header of a IP packet, aggregating audio samples into anaudio segment or performing speech detection on an audio segment,sampling sensor readings, averaging or joining the readings over a timewindow of samples, applying spatial, temporal, or frequency filters toextract specific signatures over the audio or video segments, etc. Theapplication components are composed into an application represented as adata-flow graph. A large number of such applications that can toleratepartial failures are PFT applications.

The method determines the assignment of PFT application components toclusters such that the loss in the output value of the PFT applicationsis minimized under failures.

The method incorporates the following in computing the resourceallocation: (i) a mathematical model of cluster failures where eachcluster is assigned a failure probability under a correlated failuremodel, and where individual cluster failures are considered independent;(ii) the resource capacities of clusters; and (iii) the availability andthe placement constraints provided by the applications.

The component allocation method includes the following steps.

1. First, the computing clusters are ordered—sorted in a decreasingorder—based on the ratio of their resource capacity to the failureprobability. Alternatively, the ordering may be done based on theproduct of resource capacity and (1—failure probability) (also referredto herein as availability).

2. Second, each application component is assigned a relative “importancevalue” (scalar value) defined as its contribution to the applicationoutput. Alternately, this importance value is the “loss” incurred in theapplication's total output value if the resource hosting that componentfails.

3. Third, the component allocation method uses both (a) the importancemetric to rank application components and (b) the sorted order ofclusters so that highly important components get assigned to highlyreliable computing clusters with high resource capacities.

The method may also include the step of allocating applicationcomponents based on their specified constraints on resources (such asthe need to be allocated to a cell blade or to a secure tamper-resistantnode, etc.), while still addressing the goal of minimizing the loss inthe application output value under failures.

The method determines an order for assigning components based on theapplication data flow graph such that a single cluster failure affectsthe minimal number of paths from a source to a sink.

The method aims to minimize the total weighted “loss” in the expectedapplication output value for a plurality of applications when theseapplications execute on and share access to the same set of computingclusters. Further, the method may also include factors such asprocessing component reuse and input data reuse across a plurality ofapplications, relative priorities of applications in terms of orderingtheir expected output value, fault-tolerant characteristics ofindividual applications, and delay constraints on output response by anapplication, etc.

The method is also applied when a failure occurs in a PFT application,to reallocate the failed components to the available resource clusters.

Advantageously, the inventive method provides for component placement,wherein both resource capacities and failure probabilities are used toassign application components to computing clusters. Prior work (seeU.S. Patent Application Ser. No. 11/735,026 (Attorney Docket No.YOR920060857US1), “System and Method for Dependent-Failure AwareAllocation of Distributed Data-Processing Systems,” filed Apr. 13, 2007,the disclosure of which is incorporated by reference herein) only usesresource capacities but not failure probabilities. As a result, thetechnique used in prior work might allocate all application componentsto the cluster with the largest capacity but having the smallestavailability, thereby significantly reducing the availability of theapplication hosted on the distributed data processing system.

By way of further advantage, components are allocated in decreasingimportance to clusters by defining a connected sub-graph comprisingcomponents that are all co-located on the same cluster. This allocationhas the advantage of limiting the effect of a cluster's failure to theminimal number of paths from a source to a sink. Prior work assignscomponents to the same cluster that does not necessarily form aconnected sub-graph. Therefore, a single cluster failure can affect manymore paths in the prior work's technique, which the above method forassigning processing components in this invention addresses.

Still further, the inventive method is applied during failure recovery.When a subset of the application components has failed, this method canbe applied to restore the failed components to the available resources,thereby improving the application output value.

While certain illustrative embodiments of the invention will bedescribed herein from the perspective of data stream applications, it isto be understood that the principles of the invention are not limited touse with any particular application or any data processing system.Rather, principles of the invention are more generally applicable to anyapplication and any data processing system in which it would bedesirable to minimize the effect of failures on the application outputquality.

Assuming a distributed data processing system model, the problem can beprecisely stated as follows. Given a distributed computing systemcomprising n clusters (T₁, T₂, . . . , T_(n)) each with a resourcecapacity c_(i) and a failure probability p_(i) (i ranges from [1, n]),and a PFT application made up of m components (C₁,C₂, . . . , C_(m))each of which may execute on any cluster, allocate each of the m modulesto one of the n clusters such that the loss in expected applicationoutput value is minimized under failures subject to the constraintsimposed by the application data flow graph, the resource capacities, andthe failure probabilities.

Thus, to overcome the above-mentioned drawback in distributed dataprocessing systems (i.e., in the event of a failure-oblivious allocationof application components to computing clusters, even a single clusterfailure can have a significant impact on the application's outputquality if its highly important components were placed on that cluster),principles of the invention employ a “failure aware” design concept.Such a failure aware design concept provides the differentiation betweenclusters that are highly available and clusters that are most likely tofail, and uses this information to make assignment decisions ofprocessing components to resource clusters.

FIG. 1 shows a data aggregation system according to one embodiment ofthe invention. As shown, the illustrative data aggregation systemincludes a plurality of components (11), wherein each component 11-2 and11-3 receives the data inputs for aggregation. The components forwardthe inputs (k_(p) and k_(q)) to the component 11-1 that computes theaggregate result; SUM in this case.

It is to be appreciated that such components may be logically allocatedportions of processing resources (virtual machines) within one computingsystem, such as a mainframe computer. Alternatively, they could beallocated one or more types of computing devices, e.g., server, personalcomputer, laptop computer, handheld computing devices, etc. However,principles of the invention are not limited to any particular type ofcomputing device or computing architecture. While the illustrativeembodiment shows only three nodes, it is to be appreciated that thesystem can include more than three nodes.

FIG. 2 illustrates three possible component allocations of threecomponents to two clusters for the data aggregation system in FIG. 1:(a) assign root component 11-1 to one cluster (black shaded cluster or“cluster 1”) and components 11-2 and 11-3 to another cluster (grayshaded cluster or “cluster 2”), (b) assign 11-1 and 11-3 to the graycluster and 11-2 to the black cluster, and (c) assign all 11-1, 11-2,and 11-3 to the gray cluster.

Note that allocation (b) is better than allocation (a) because if theblack cluster fails, then the application output for allocation (a) goesto 0. On the other hand, under allocation (b), the system could stillprocess data flowing from 11-3 to 11-1. If the gray cluster fails, bothallocations give no output. A careful calculation shows that the bestallocation, however, is (c) that keeps all components on the samecluster. The main intuition behind this is that only one cluster failurescenario affects allocation (c), while two cluster failures scenarioscan hinder allocations (a) and (b).

There are several important observations from this example. First, weobserve that it is preferable to allocate as many components as possibleto the same cluster (subject to cluster resource constraints) tominimize the loss in the expected output value under failures. Second,we observe that it is preferable to assign components on independentpaths to different clusters to avoid dependent failures. Finally, forheterogeneous clusters with different failure probabilities, we observethat it is preferable to assign “highly important” components toclusters with the lowest failure probabilities. We use theseobservations in designing a component placement algorithm to bedescribed below.

These observations suggest three guiding principles: (1) components ofhigher importance should be placed on clusters with highest capacitiesand lowest failure probabilities; (2) all components lying on a pathfrom a source to the sink should be co-located on the same cluster (ifpossible), i.e., minimize the total number of clusters on all paths; and(3) assign components on independent paths to different clusters toavoid dependent failures.

The method of component allocation defines a connected sub-graph ofprocessing components that are all allocated to the same resourcecluster. The practical advantage of this method is to have minimaleffect of a single cluster failure on the number of affected paths.

FIGS. 3A and 3B illustrate a flow diagram showing a method forallocating components of PFT application running on a distributed dataprocessing systems in accordance with one embodiment of the invention.

In general, the steps of FIG. 3 correspond to the following pseudo-codewhich describes a fault-aware component placement algorithm. Thus,reference will be made below to the steps of FIG. 3 that correspond tothe pseudo-code.

Algorithm 300 starts (301) by inputting (302) a set C of all PFTapplication components, a set T of all clusters, and the applicationdata flow graph G(C, E). The algorithm proceeds as follows:

1: Calculate the importance I(C) for components C={C₁, C₂, . . . ,C_(m)} (303).

2: Rank the clusters T₁, T₂, . . . , T_(n) sorted (decreasing) onc_(j)/p_(j) (j ranges from [1, n]) (303).

3: j:=1 (303)

4: while set C is not empty do (304)

5: Select the highest importance component C_(i) from C (305)

6: while T_(j) has spare capacity do (306)

7: Assign C_(i) to T_(j); remove C_(i) from set C; initialize set SG to{C_(i)} (307 and 308)

8: Select highest importance C_(k) from C such that C_(k) is connectedto SG by an edge in E (as described below) (309)

9: If C_(k) satisfying (8:) AND T_(j) has spare capacity then (310)

10: Assign C_(k) to T_(j); remove C_(k) from set C; add {C_(k)} to SG(311 and 312)

11: else {no such C_(k) exists OR T_(j) has no spare capacity}

12: break;

13: end if

14: end while

15: if T_(j) has no spare capacity then (306)

16: j:=j+1 (313)

17: end if

18: end while

19: stop (314)

Thus, in more general terms, given an application data flow graph G(V,E), the method for component assignment includes the following step:allocate components in decreasing importance to clusters ranked byc_(j)/p_(j) (j ranges from [1, n]). The method may further define aconnected sub graph SG of components that are co-located on the samecluster (say T) as follows: at each step, assign the highest importanceC_(k) if: (1) T has spare capacity; and (2) C_(k) is connected to SG byan edge in E, i.e., there is an edge from C_(k) to C_(p) and C_(p)belongs to the sub-graph SG.

The method for component assignment may perform the step of allocatingcomponents in decreasing importance to clusters ranked by c_(j) *(1-p_(j)) (j ranges from [1, n]) where 1-p_(j) is also termed asavailability of a cluster.

Further, ties between clusters having an equal ratio of c_(j)/p_(j) orc_(j) * (1-p_(j)) can either be arbitrarily broken, or based oncomparing p_(j) values against a threshold and selecting the clusterwith the smaller p_(j) value, or based on comparing c_(j) values againsta threshold and selecting the cluster with the higher c_(j) value, orbased on selecting the cluster with the smaller p_(j) value if both theclusters satisfy a minimum threshold of c_(j), or based on selecting thecluster with the higher c_(j) value if both the clusters satisfy amaximum threshold of p_(j), or any combination of these schemes andother techniques.

Embodiments of the present invention can take the form of an entirelyhardware embodiment, an entirely software embodiment or an embodimentincluding both hardware and software elements. In a preferredembodiment, the present invention is implemented in software, whichincludes but is not limited to firmware, resident software, microcode,etc.

Furthermore, the invention can take the form of a computer programproduct accessible from a computer-usable or computer-readable mediumproviding program code for use by or in connection with a computer orany instruction execution system. For the purposes of this description,a computer-usable or computer readable medium can be any apparatus thatmay include, store, communicate, propagate, or transport the program foruse by or in connection with the instruction execution system,apparatus, or device. The medium can be an electronic, magnetic,optical, electromagnetic, infrared, or semiconductor system (orapparatus or device) or a propagation medium. Examples of acomputer-readable storage medium include a semiconductor or solid statememory, magnetic tape, a removable computer diskette, a random accessmemory (RAM), a read-only memory (ROM), a rigid magnetic disk and anoptical disk. Current examples of optical disks include compactdisk—read only memory (CD-ROM), compact disk—read/write (CD-R/W) andDVD.

A data processing system suitable for storing and/or executing programcode such as the computing system 400 shown in FIG. 4 may include atleast one processor 402 coupled directly or indirectly to memoryelement(s) 404 through a system bus 410. The memory elements can includelocal memory employed during actual execution of the program code, bulkstorage, and cache memories which provide temporary storage of at leastsome program code to reduce the number of times code is retrieved frombulk storage during execution. Input/output or I/O device(s) 406(including but not limited to keyboards, displays, pointing devices,etc.) may be coupled to the system either directly or throughintervening I/O controllers.

Network adapter(s) 408 may be included to enable the data processingsystem to become coupled to other data processing systems or remoteprinters or storage devices through intervening private or publicnetworks. Modems, cable modem, and Ethernet cards are just a few of thecurrently available types of network adapters. It is to be appreciatedthat the term “processor” as used herein is intended to include anyprocessing device, such as, for example, one that includes a CPU(central processing unit) and/or other processing circuitry. It is alsoto be understood that the term “processor” may refer to more than oneprocessing device and that various elements associated with a processingdevice may be shared by other processing devices. Thus, softwarecomponents including instructions or code for performing themethodologies described herein may be stored in one or more of theassociated memory devices (e.g., ROM, fixed or removable memory) and,when ready to be utilized, loaded in part or in whole (e.g., into RAM)and executed by a CPU.

Although illustrative embodiments of the present invention have beendescribed herein with reference to the accompanying drawings, it is tobe understood that the invention is not limited to those preciseembodiments, and that various other changes and modifications may bemade by one skilled in the art without departing from the scope orspirit of the invention.

1. A method for allocating a set of one or more processing components of an application to a set of one or more resource groups, comprising the steps performed by a computer system of: ordering the set of one or more resource groups based on respective failure measures and resource capacities associated with the one or more resource groups; assigning an importance value to each of the one or more components, wherein the importance value is associated with an affect of the component on an output of the application; and assigning the one or more components to the one or more resource groups based on the importance value of each component and the respective failure measures and resource capacities associated with the one or more resource groups, wherein components with higher importance values are assigned to resource groups with lower failure measures and higher resource capacities.
 2. The method of claim 1, wherein the application is a partial fault tolerant (PFT) application that comprises a set of one or more PFT application components.
 3. The method of claim 1, wherein the set of one or more resource groups comprise a heterogeneous set of resource groups.
 4. The method of claim 1, wherein the ordering step comprises sorting the one or more resource groups in a decreasing order based on a ratio of a respective resource capacity of each of the one or more resource groups to a failure probability of each of the one or more resource groups.
 5. The method of claim 1, wherein the ordering step comprises sorting the one or more resource groups in a decreasing order based on a product of a respective resource capacity of each of the one or more resource groups and an availability measure of each of the one or more resource groups.
 6. The method of claim 5, wherein the availability measure for a given resource group is computed as one minus a failure probability of the given resource group.
 7. The method of claim 1, wherein the importance value assigned to a given component is based on a contribution that the given component makes to the application output.
 8. The method of claim 1, wherein the importance value assigned to a given component is based on a loss incurred in the application output value if the resource hosting the given component fails.
 9. The method of claim 1, wherein the step of assigning the one or more components to the one or more resource groups is also based on one or more specified constraints on the one or more components.
 10. The method of claim 1, wherein an order for assigning components is determined based on a data flow graph associated with the application such that a single resource group failure affects the minimal number of paths from a source to a sink in the data flow graph.
 11. The method of claim 1, wherein the step of assigning the one or more components to the one or more resource groups is performed responsive to a failure of at least one of the resources, making unavailable at least one of the components assigned thereto.
 12. The method of claim 1, wherein the effect of a given component on the output of the application comprises an effect of the given component on an output quality of the application.
 13. The method of claim 12, wherein the effect of a given component on the application output quality is based on the given component being in one or more paths of a data flow graph associated with the application.
 14. The method of claim 1, wherein the step of assigning the one or more components to the one or more resource groups comprises defining, within a data flow graph associated with the application, a connected sub-graph of components assigned to a given resource group.
 15. An article of manufacture for allocating a set of one or more components of an application to a set of one or more resource groups, the article comprising a non-transitory computer readable storage medium containing one or more programs, which when executed by a computer implement the steps of claim
 1. 16. Apparatus for allocating a set of one or more components of an application to a set of one or more resource groups, comprising: a memory; and at least one processor coupled to the memory and operative to perform the steps of: ordering the set of one or more resource groups based on respective failure measures and resource capacities associated with the one or more resource groups; assigning an importance value to each of the one or more components, wherein the importance value is associated with an effect of the component on an output of the application; and assigning the one or more components to the one or more resource groups based on the importance value of each component and the respective failure measures and resource capacities associated with the one or more resource groups, wherein components with higher importance values are assigned to resource groups with lower failure measures and higher resource capacities.
 17. The apparatus of claim 16, wherein the application is a partial fault tolerant (PFT) application that comprises a set of one or more PFT application components.
 18. The apparatus of claim 16, wherein the ordering step comprises sorting the one or more resource groups in a decreasing order based on a ratio of a respective resource capacity of each of the one or more resource groups to a failure probability of each of the one or more resource groups.
 19. The apparatus of claim 16, wherein the ordering step comprises sorting the one or more resource groups in a decreasing order based on a product of a respective resource capacity of each of the one or more resource groups and an availability measure of each of the one or more resource groups.
 20. The apparatus of claim 16, wherein the importance value assigned to a given component is based on a contribution that the given component makes to the application output.
 21. The apparatus of claim 16, wherein the importance value assigned to a given component is based on a loss incurred in the application output value if the resource hosting the given component fails. 